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Abstract 

Multilayer perceptrons (MLP) with one hidden layer have been 
used for a long time to deal with non-linear regression. However, in 
some task, MLP's are too powerful models and a small mean square 
error (MSE) may be more due to overfitting than to actual modelling. 
If the noise of the regression model is Gaussian, the overfitting of 
the model is totally determined by the behavior of the likelihood ra- 
tio test statistic (LRTS), however in numerous cases the assumption 
of normality of the noise is arbitrary if not false. In this paper, we 
present an universal bound for the overfitting of such model under 
weak assumptions, this bound is valid without Gaussian or identifia- 
bility assumptions. The main application of this bound is to give a 
hint about determining the true architecture of the MLP model when 
the number of data goes to infinite. As an illustration, we use this 
theoretical result to propose and compare effective criteria to find the 
true architecture of an MLP. 

1 Introduction 



Feed-forward neural networks are well known and popular tools to deal with 
non-linear regression models. We can describe MLP models as a parametric 
family of regression functions. White [8] reviews statistical properties of 
MLP estimation in detail. However he leaves an important question pending 
i.e. the asymptotic behavior of the estimator when the MLP in use has 
redundant hidden units. If we assume that the noise is Gaussian it is well 
known that the Least square Estimator (LSE) and the maximum likelihood 
estimator (MLE) are equivalent and Amari et al. [1] give several examples of 
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the behavior of MLE in case of redundant hidden units. Moreover, if n is the 
number of observations, Fukumizu pj] shows that, for unbounded parameters 
and Gaussian noise, LRTS can have an order lower bounded by 0(log(n)) 
instead of the classical convergence property to a x 2 1 &W - 111 the same spirit, 
Hagiwara and Fukumizu [3] investigate relation between LRTS divergence 
and weight size in a simple neural networks regression problem. Hence, if 
parameters of MLP models are not bounded, these papers show that, for 
Gaussian noise, the overfitting is strong, even if the number of data is large. 
Note that this is no more the case if the parameters of the MLP are supposed 
to be a priori bounded. 

But, even if Gaussian assumption for the noise is standard, it may be not 
suitable for some models. This assumption is false, for example, when the 
range of observations is known to be bounded, since Gaussian variables can 
be arbitrary large in absolute value, even if the probability of such events is 
small. Hence, we need a theory which gives evaluation of the overfitting of 
MLP regression without knowing the density of the noise and which works 
even if the model is not identifiable. 

In this paper, we prove an inequality bounding the MSE difference be- 
tween the true model and an over-determined model. This inequality shows 
that, under suitable assumptions, the asymptotic overfitting of the MSE is 
upper bounded by the maximum of the square of a Gaussian process. More- 
over, this bound shows that suitably penalized MSE criteria allow to select 
asymptotically the true model. The paper is organized as follows: In section 
2 we state the model, section 3 presents our main inequality and in section 4 
we apply this inequality to select the optimal architecture of the MLP model. 
Finally, in section 5, a little experiment gives us some insight to apply this 
theoretical results in real life problems. 

2 The model 

Let x = (x(l), • ■ ■ , x(d)) T G M. d be the vector of inputs and 

Wi := (wa, ■ ■ ■ , Wid) T G M. d be a parameter vector for the hidden unit i. The 

MLP function with k hidden units can be written : 

k 

fe(x) = (3 + ^ a ^ { W I X + b i) ' 
i=i 
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with 9 = (/?,ai,--- ,Ofc,6i,--- , 6 fc , twu, • • • ,w ld ,--- ,w kl ,--- ,w kd ) the pa- 
rameter vector of the model and <fi a bounded transfer function, usually a 
sigmoidal function. Note that we consider only real functions, extension to 
vectorial functions is straightforward but not discussed in this paper. Let 
6fc C K fcx ( d + 2 )+ 1 be a compact (i.e. closed and bounded) set of possible 
parameters, we consider regression model S = {fo(y,x), 9 G 6^} with 

Y = f e {X) + e (1) 

X is a random input variable and e is the noise of the model. Let n be a 
strictly positive integer, we assume that the observed data (x\, yi) , • • • , (x n , y n ) 
come from a true model (Xi, Yi) ieNi>0 of which the true regression function 
is fgo, for an 9° (possibly not unique) in the interior of 6fc. 

2.1 Estimation of MLP regression model 

The main goal of non-linear regression is to give an estimation of the true 
parameter 9° based on observations ((xi, yi), ■ • • , (x n , y n ))- This can be done 
by minimizing the Mean Square Error (MSE) function: 

E n (e):= l -j^(y t -f e (x t )) 2 (2) 
t=i 

with respect to parameter vector 9 G 0fe. The parameter vectors 9 n real- 
izing the minimum will be called Least Square Estimator (LSE). Note that 
parameters realizing the true distribution function may belong to a non-null 
dimension sub-manifold if the number of hidden units is overestimated. Sup- 
pose, for example, we have a multilayer perceptron with two hidden units 
and the true function f g o is given by a perceptron with only one hidden unit, 
sa y fe° = a tanh(u! x), with iGl. Then, any parameter 9 in the set: 

{9 \w 2 — w 1 — w Q , b 2 = h = 0, ai + a 2 = a } 

realizes the function f e o. Hence, classical statistical theory for studying the 
LSE can not be applied because it requires the identification of the parame- 
ters (up to a permutation). 

In the next section, we will compare MSE of over-determined models 
against MSE of the true model : 

n 1 n 

- X> " fo(xt)) 2 - - X> " fe°(xt)f = E n (9) - E n (9°). (3) 
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3 A general bound for the MSE 

For an square integrable function g(X, Y) the L2 norm is: 



\g(X,Y)\\ 2 :=Jj g 2 (x,y)dP(x,y). (4) 



Now, for A > 0, let us define the generalized derivative function : 

e -X(Y-f H {X)) 2 p -MY-f„ (X)) 2 



A(y-/ 9 (x))2_ e -My-/ fl o(*)) 2 .. I, A((y-/ 9 (X))2-(Y-/, (X))2) _ 1 |i 

II e -A(r-/ e oW) 2 I' 2 " 112 

(5) 

and let us define (g$) _ (x, y) = min {0, c$(£, y)}. Note that the generalized 
derivative function converges toward the derivative function if 9 converges 
toward 9 . 

For now, let us assume that dg is well defined, this point will be discuss 
later. We can state the following inequality: 
Inequality: 

for A > 0, 



sup n ■ (E n (9°) - E n {9)) < t^t sup J^'f^ : 



(6) 



Proof: 

We have 



n-(E n (9 )-E n (9)) = 

1 V" W f 1 + || e -^-/ 9 W) 2 - e -M^-/ fl oW) 2 || ,w A 

a Z^i=i lQ g I 1 + II e -\(Y-f e o(x))* \\2a 6 {Xt,y t ) I 



< 



0<P<|| -A (y -/ 90 (X)) 2 



su Pp>o J (pC=i - T £"=1 fc, &)) • 

Since for any real number u, log(l + w) < w — . Finally, replacing p by 
the optimal value, we found 

n ■ (EJ9°) - EJ9)) < A E? = ld y a ai ' y<) 
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This inequality allows to prove the tightness of n- (E n (9°) — E n (8)) under 
simple assumptions. It is used in the next section to prove consistency of an 
estimator of the number of hidden units using penalized MSE criterion. 

4 Estimation of the number of hidden units. 

Let A; be the minimal number of hidden units needed to realize the true 
regression function fgo. In this section, the set of possible parameters will 
be set to 

e = uf =1 e fe , 

where K is a, possibly huge, fixed constant: The maximum number of hidden 
units for MLP models. We define the minimum-penalized MSE estimator of 
k°, as the minimizer k of 

T n (k) = mm (E n (6) + a n (k)) (7) 

Let us assume the following assumptions: 

(Al) a n (.) is increasing, n ■ (a n (ki) — a n (k2)) tends to infinity as n tends to 
infinity, for any k\ > k2 and a n (k) tends to as n tends to infinity for 
any k. 

(A2) It exists A > so that {d$,6 G 9} is a Donsker class (see van der 
Vaart [7]) and appendix. 

We now have: 
Theorem: 

Under (Al) and (A2), k converges in probability to the true number of 

hidden units k°. 

Proof: 

By applying the inequality, 
P(k > fc°) < Ef= fc o +1 P(T n (k) > T n (k )) = 

Ef=feo+i P (n (sup 0e e fcO E n {6) - sup ee0fc E n {9)j > n {a n {k) - a n (/c ))) < 
ELo +1 P [\ su Peeefe > n (a n (k) - a n (k^ 
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Now, under (A2) 



^(j^d^y^ =0 P (1) 



where, O p (l) means bounded in probability. Moreover, under (A2) the set 
| (dg(xi, ?/i)) 2 | is Glivenko-Cantelli (the set admits an uniform law of large 
numbers). Hence 



1=1 



But inf eg e fe || (g$(X, Y)) _ || 2 > 0, since the random variable dg(X, Y) is cen- 
tered and \\d%(X,Y)\\ 2 = 1. Then, we get : 

i Y^AM) = (1) 

and P(& > A; ) tends to as n tends to infinity. Finally, 

P (k <k°)< V p f sup g.W-g.tf") > -(*)- <■.(*")) 

and sup 6 , g6fc En ^~^ n( - ) converges in probability to 

sup £ (£ n (0) - £„(0 )) < 

since k < k°, so A; — k° ■ 

The assumption (Al) is fairly standard for model selection, in the Gaus- 
sian case (Al) will be fulfilled by the BIC criterion. The assumption (A2) 
is more difficult to check. First we note: 

^ e -A((y-/ e (x)) 2 -(y-/ e0 (x))2) _ x y = 

e -2A((Y-/ e (X)) 2 -(y-/ e0 (X)) 2 ) _ 2e -x((Y-f e {X)f-(Y-f e0 (X)f) + 1 
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So, dg is well defined if E 

(Y - f e {X)f - (Y - f e0 {X)f = 

(Y - f e o(X) + f e0 (X) - f e {X) f - (Y - f e o(X)Y = 

2e(f e o(X) - f e (X)) + (f e o(X) - f e {X)f 

where e — Y — fgo(X) is the noise of the model. Since an MLP function is 
bounded, dg is well defined if A > exists such that e A ' e ' < oo i.e. e admits 
exponential moments. Finally, using the same techniques of reparameteriza- 
tion as in Rynkiewicz [B], assumption (A2) can be shown to be true for MLP 
models with sigmoidal transfer functions, if the set of possible parameters 
is compact. 



-2\((Y-f e (X)f-(Y-f g0 {X)f 



< oo, but 



5 A little experiment 

The theoretical penalization terms of the previous section can be chosen 
among a wide range of functions (see condition Al). In the sequel, a little 
experiment is conducted to assess the right rate of penalization to guess the 
"true" architecture of a model. 
Consider a simulated model: 

Z t = F e o(X t ,Y t ) + s t ,t = l,--- ,n, 

with (pd, Y x ), ■ ■ ■ , (X n , Y n )) i.i.d., (X t , Y t ) ~ M (0 R2 , 3 • J 2 ), 
(ei, • • • , e n ) i.i.d., e t ~ U [— 1, 1], the uniform law in [—1; 1] and 



F g o (x, y) = tanh(6 • x — 2 • y) + 2 • tanh(8 — x + 3 • y) 
-3 ■ tanh{2 - 6 ■ x - 2 ■ y) + 1.5. 



(8) 



Here, the true model is an MLP with 2 inputs, 3 hidden units and one output. 
In order to avoid too long time of computation, the number of hidden units 
is assumed to be between 1 and 10. We estimate the true architecture of the 
MLP according to ([7]). 

First, let us write the log-likelihood of the data as if the density of the 
noise would be Gaussian : 



X\ \ / x Ti 

y 1 }■■■ >[ Vn I I =-fl()g(2^-) 

Y? t =i 2^ ( z t - F e(%t, yt)) 2 + EILi 9(%t, y t ) 



(9) 
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Here, we assume that the variance of the noise a 2 is a known constant and 
that the density of the explicative variables (X, Y) is a function g(., .) inde- 
pendent of the parameter vector 9. Two classical criteria are: 



AIC : Eli h & - Mxt, Vt)f + 2-D + Cte 
BIC : Zt=i £ (z t - F e {x u y t )f + D ■ log(ra) + Cte 



(10) 



where D is the size of the parameter vector (the dimension of the model or 
the number of weights of the MLP) and Cte a constant independent of the 
parameter 9. 

The optimization of the log-likelihood is done with respect to the param- 
eter 9, so maximimizing this quantity is equivalent to minimizing : 

1 " 

E n (9) = --J2(zt-Fe(*t,yt)) 2 (11) 

t=i 

Hence, an AIC like minimum-penalized MSE criterion would be: 

lA. nn2 2a 2 D 

and, a BIC like minimum-penalized MSE criterion would be: 

if, „, n2 a 2 D\og{n) 

n n 
t=i 

Note, that these criteria involve the knowledge of the variance of the noise 
a 2 . In a first experiment we will use the true variance of the noise, then this 
problem will be addressed in the following sections. 

In the following sections, the optimization of MLP is done with the Broy- 
denFletcherGoldfarbShanno (BFGS) method. In order to avoid bad local 
minima, 10 random initializations of the weights are done for each estima- 
tion. 



5.1 Model selection with a 2 known 

We will compare 4 criteria, from the least penalized (AIC like) to the most 
penalized (Very Strong Penalization), the following penalized criteria are 

assessed: 
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. AIC like: i Er=i (* " *iC*, ^)) 2 + 

. BIC like: I Zt=i (* - F (x t , y t )f + 

• SP (Strong Penalization): J X^=i _ ?/t)) 2 + 

• VSP (Very Strong Penalization): ± ( z t ~ F e (x t ,y t )) 2 + 

We simulate n = 100, n = 500 and n = 1000 data according to the true 
model (IE]), for each n the experiment is repeated 100 times. 

The following architectures are chosen by the penalized criteria : 

• n=100 





nb h. units 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


AIC like 


models sel. 








47 


35 


10 


2 


3 


3 








BIC like 


models sel. 








100 























SP 


models sel. 








100 























VSP 


models sel. 





74 


26 
























• n=500 





nb h. units 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


AIC like 


models sel. 








3 


14 


17 


14 


13 


15 


10 


13 


BIC like 


models sel. 








100 























SP 


models sel. 








100 























VSP 


models sel. 





1 


99 
























• n=1000 





nb h. units 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


AIC like 


models sel. 























7 


24 


69 


BIC like 


models sel. 








100 























SP 


models sel. 








100 























VSP 


models sel. 








100 
























The BIC like criterion and the Strong Penalization chose always the true 
architecture whatever the number of data. According to the theory, AIC like 
criterion is not consistent (see condition Al) and the chosen architecture is 
always too large. The Very Strong penalization chose a too small architecture 
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when the number of data is small (n = 100), however it is a consistent 
criterion, so its behavior is correct for larger number of data (n = 500 and 
n = 1000). This good results assume that the true variance of the noise is 
known, but for regression models this is never the case. A first idea may be 
to replace the unknown variance by the estimated one, the is done in the 
next section. 

5.2 Model selection using estimated variance a 2 . 

The estimated variance a 2 is the mean square error of the model: 

1 



a 2 :=E n (6) = -J2(zt-F § (x t ,y t )y 



computed for the least square estimator 6. Hence, the comparison is done 
with the penalized criteria : 

. AIC like: I YJU (* " M*t, Vt)? + ^ 
. BIC like: \ (z t - F e (x t , y t )f + 

• SP (Strong Penalization): \ £? = i (z t - F e (x t , y t )f + 

• VSP (Very Strong Penalization): I J2t=i " Fe(x t , y t )f + 
The following architectures are chosen by the penalized criteria: 

• n=100 





nb h. units 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


AIC like 


models sel. 

















1 





4 


27 


68 


BIC like 


models sel. 








22 


6 


3 


1 


3 


5 


15 


45 


SP 


models sel. 








85 


3 


1 











1 


10 


VSP 


models sel. 








100 
























• n=500 





nb h. units 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


AIC like 


models sel. 











4 


4 


6 


9 


23 


19 


35 


BIC like 


models sel. 








100 























SP 


models sel. 








100 























VSP 


models sel. 








100 
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• n=1000 





nb h. units 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


AIC like 


models sel. 














3 


7 


12 


16 


29 


33 


BIC like 


models sel. 








100 























SP 


models sel. 








100 























VSP 


models sel. 








100 
























As usual the AIC like criterion misbehaves like in the previous section. But, 
for a small number of data (n = 100) the use of an estimation of the variance 
of the noise instead of the true one leads to overestimation of the number of 
hidden unit for the BIC like criterion and the strong penalization. The ex- 
planation is that the variance of the noise is underestimated for large number 
of hidden units and so the penalized criterion. This drawback disappears for 
larger number of data (n = 500 and n = 1000) because the estimation of the 
variance becomes better. Despite that the Very Strong Penalized criterion 
seems to guess the true architecture whatever the number of data, to plug 
estimated variance instead of the true one seems maybe to be a too naive 
approach for small number of data. As the goal of penalized criterion is to 
compare models, we could use the logarithm of the mean square error in- 
stead of the mean square error itself, hence the lack of the true variance is 
no more a problem because this number is simplified in the difference of the 
logarithms. This approach is studied in the next section. 

5.3 Model selection using logarithm of mean square 
error 

Choosing between two number of hidden units, k\ and k 2 is the result of a 

comparison between min 0lg e fel (E n (9i) + a n (ki)) on one hand and min^ge^ (E n (9) + 

on the other hand, where a n (k) is the penalization term. So, the results of the 

comparison may be changed if we consider C ■ E n (9) instead of E n (9) when C 

is a very big (or very small) constant. But, if we compare min 0lG e fci (log (E n (6i)) + a, 

and 

min6i 2e e fc2 (log (E n (9 2 )) + a n (A; 2 )) the results of the comparison is the same if 
we change E n {9) in C ■ E n {9). 

Moreover, E n {9) may be seen as an approximation of the variance of the 
noise a 2 , so E n (9) = a 2 ■ (1 - e(9)) and 

log (E n (9)) ^\og(a 2 ) - e(9) + o(e(9)) 
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The term e(6) may be seen as the normalized term of overfitting of the 
model. Finally we get: 

min 06e (log (E n (9)) + a„(fci)) - mm ee@ (log (E n (9)) + a n (k 2 )) ~ 
e(0 2 ) - e(0i) + a n (k 2 ) - a n (/ci) 

and the penalization term plays fully is role of compensation of the "normal- 
ized" overfitting. 

The results on our little experiment with these new criteria are the fol- 
lowing: 

• n=100 





nb h. units 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


AIC like 


models sel. 








3 


7 


4 


11 


9 


13 


24 


29 


BIC like 


models sel. 








96 


4 




















SP 


models sel. 








100 























VSP 


models sel. 


73 


27 



























• n=500 





nb h. units 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


AIC like 


models sel. 








3 


6 


8 


13 


18 


10 


17 


25 


BIC like 


models sel. 








99 


1 




















SP 


models sel. 








100 























VSP 


models sel. 





69 


31 
























• n=1000 





nb h. units 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


AIC like 


models sel. 








1 


2 


7 


13 


16 


21 


19 


21 


BIC like 


models sel. 








100 























SP 


models sel. 








100 























VSP 


models sel. 





2 


98 
























We can see that this method yields very good results, whatever the num- 
ber of data, for BIC like criterion and the Strong Penalization without know- 
ing the true variance of the noise. Strong Penalization seems even to be a 
little better than BIC like criterion. 
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6 Conclusion 



MLP regression is widely used and always a very competitive method (see 
Osowski et al. [5]), but theoretical justification is lacking for determining 
the true architecture and especially the number of hidden units. Indeed, the 
classical asymptotic theory fails when the model is not identifiable. In this 
paper, we prove an inequality showing that overfitting of MLP is moderate 
if the noise admits exponential moments and the parameters of the model 
are a priori bounded. This bound justifies the use of penalized criteria in 
order to fit the architecture of MLP models in the framework of regression 
without knowing the density of the noise. Hence, The user can select the 
true number of hidden units thanks to penalized criteria, of the form 

E n {9) + a n {k) 
or 

\n(E n (9)) + a n (k) 

If the penalization term a n (k) is well calibrated (— < a n (k) < C ■ k), 
the true number of hidden units will be automatically selected if n is large 
enough. A little experiment suggests that a good choice of penalization 
seems to be the middle of the possible range: a n (k) = The use of the 
logarithm of the mean square error E n (9) is an easy way to avoid to know 
the true variance of the noise. A further question could be to know if this 
empirical finding for the tuning of the penalization term can be justified 
theoretically. 

Note that, this paper was only concerned with the identification of the 
true model. The point is more to have an idea of the complexity of the 
model determining the data than to have a predictive model. However, if 
there are enough data, the true model will also be the best predictor. Hence, 
the spirit of the studied criterion is then very different from the approach 
used in Extreme Learning Machine (ELM) which provides a good predictive 
model at extremely fast learning speed. Indeed, in ELM, all the hidden node 
parameters are independent from the target functions or the training datasets 
so the weights between input and hidden node are no more considered as 
parameters. Such method is really not concerned by model identification but 
only by predictive power and speed of computing, for example it is possible 
to use ELM even if the relation between inputs and ouput is linear. Finally, 
If ELM seems to work well in pratice, the theoritical justification of the 
superiority of such method is still lacking. 
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Another issue may be when the unknow regression function is not repre- 
sented by an MLP at all, in such case there is no more over-determination 
and the asymptotic behaviour of the model is more underfitting than over- 
fitting. The overfitting occures only when we consider finite number of data 
and the theory to deal with such problem is the very difficult non-asymptotic 
statistical theory as in Massart [4]. As far as we know this theory gives no 
hints for choosing the number of hidden units for finite number of data. 

Appendix 

A Donsker class is a notion from the "empirical processes theory". This 
theory deals with "law of large number" and "asymptotic normality" for set 
of functions. Basically, a Donsker class is a set of functions which is "not too 
big". 

Let Xi , ■ ■ ■ , X n be a random sample from a probability distribution P. 
The empirical distribution is the discrete uniform measure on the observa- 
tions. We denote it by P n = - ^2i=i^Xi, where 8 X is the probability distri- 
bution that is degenerate at x. Given a function /, we write F n f for the 
expectation of / under the empirical measure and Pf for the expectation 
under P. Thus 

i n r 

W n f=-J2f(X l ) andP/= fdP 

The empirical process evaluated at / is defined as G n f = \/n{P n f — Pf). 
Consider a set of functions T endowed with the L2 norm ||-|| (see (J3])). For 
every e > 0, we define an e-bracket by 

[I, u] = {/ G J 7 , I < / < u} such that \\u — l\\ < e. The e-bracketing entropy 
is 

« H {e,F, ||-||) = In (jVj.j (£,^11-11)), 

where A/j.i (e, J 7 , ||-||) is the minimum number of e-brackets necessary to cover 
J- '. Af[.] (e, J 7 , \\-\\) is also called "covering number". 

The class J 7 of functions is called Donsker if the covering number, which 
depends of the diameter e of the balls, is smaller than order when e goes 
to 0. Then, the sequence of processes {G n f, f G J 7 } converges in distribution 
to a tight limit process. Finally, if T = {<fg, 9 G 0} is a Donsker class, the 
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sequence of processes 



1 



i=i 



n 




converges in distribution to a tight random Gaussien process. We get then 
the key property needed for the demonstration of the theorem: 
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